INTERSPEECH.2020 - Speech Synthesis | Cool Papers

#1 Knowledge-and-Data-Driven Amplitude Spectrum Prediction for Hierarchical Neural Vocoders [PDF] [Copy] [Kimi¹]

In our previous work, we have proposed a neural vocoder called HiNet which recovers speech waveforms by predicting amplitude and phase spectra hierarchically from input acoustic features. In HiNet, the amplitude spectrum predictor (ASP) predicts log amplitude spectra (LAS) from input acoustic features. This paper proposes a novel knowledge-and-data-driven ASP (KDD-ASP) to improve the conventional one. First, acoustic features (i.e., F0 and mel-cepstra) pass through a knowledge-driven LAS recovery module to obtain approximate LAS (ALAS). This module is designed based on the combination of STFT and source-filter theory, in which the source part and the filter part are designed based on input F0 and mel-cepstra, respectively. Then, the recovered ALAS are processed by a data-driven LAS refinement module which consists of multiple trainable convolutional layers to get the final LAS. Experimental results show that the HiNet vocoder using KDD-ASP can achieve higher quality of synthetic speech than that using conventional ASP and the WaveRNN vocoder on a text-to-speech (TTS) task.

#2 FeatherWave: An Efficient High-Fidelity Neural Vocoder with Multi-Band Linear Prediction [PDF] [Copy] [Kimi¹]

Authors: Qiao Tian ; Zewang Zhang ; Heng Lu ; Ling-Hui Chen ; Shan Liu

In this paper, we propose the FeatherWave, yet another variant of WaveRNN vocoder combining the multi-band signal processing and the linear predictive coding. The LPCNet, a recently proposed neural vocoder which utilized the linear predictive characteristic of speech signal in the WaveRNN architecture, can generate high quality speech with a speed faster than real-time on a single CPU core. However, LPCNet is still not efficient enough for online speech generation tasks. To address this issue, we adopt the multi-band linear predictive coding for WaveRNN vocoder. The multi-band method enables the model to generate several speech samples in parallel at one step. Therefore, it can significantly improve the efficiency of speech synthesis. The proposed model with 4 sub-bands needs less than 1.6 GFLOPS for speech generation. In our experiments, it can generate 24 kHz high-fidelity audio 9× faster than real-time on a single CPU, which is much faster than the LPCNet vocoder. Furthermore, our subjective listening test shows that the FeatherWave can generate speech with better quality than LPCNet.

#3 VocGAN: A High-Fidelity Real-Time Vocoder with a Hierarchically-Nested Adversarial Network [PDF] [Copy] [Kimi¹]

Authors: Jinhyeok Yang ; Junmo Lee ; Youngik Kim ; Hoon-Young Cho ; Injung Kim

We present a novel high-fidelity real-time neural vocoder called VocGAN. A recently developed GAN-based vocoder, MelGAN, produces speech waveforms in real-time. However, it often produces a waveform that is insufficient in quality or inconsistent with acoustic characteristics of the input mel spectrogram. VocGAN is nearly as fast as MelGAN, but it significantly improves the quality and consistency of the output waveform. VocGAN applies a multi-scale waveform generator and a hierarchically-nested discriminator to learn multiple levels of acoustic properties in a balanced way. It also applies the joint conditional and unconditional objective, which has shown successful results in high-resolution image synthesis. In experiments, VocGAN synthesizes speech waveforms 416.7× faster on a GTX 1080Ti GPU and 3.24× faster on a CPU than real-time. Compared with MelGAN, it also exhibits significantly improved quality in multiple evaluation metrics including mean opinion score (MOS) with minimal additional overhead. Additionally, compared with Parallel WaveGAN, another recently developed high-fidelity vocoder, VocGAN is 6.98× faster on a CPU and exhibits higher MOS.

#4 Lightweight LPCNet-Based Neural Vocoder with Tensor Decomposition [PDF] [Copy] [Kimi¹]

Authors: Hiroki Kanagawa ; Yusuke Ijima

This paper proposes a lightweight neural vocoder based on LPCNet. The recently proposed LPCNet exploits linear predictive coding to represent vocal tract characteristics, and can rapidly synthesize high-quality waveforms with fewer parameters than WaveRNN. For even greater speeds, it is necessary to reduce the time-heavy two GRUs and the DualFC. Although the original work only pruned the first GRU weight, there is room for improvements in the other GRU and DualFC. Accordingly, we use tensor decomposition to reduce these remaining parameters by more than 80%. For the proposed method we demonstrate that 1) it is 1.26 times faster on a CPU, and 2) it matched naturalness of the original LPCNet for acoustic features extracted from natural speech and for those predicted by TTS.

#5 WG-WaveNet: Real-Time High-Fidelity Speech Synthesis Without GPU [PDF] [Copy] [Kimi¹]

Authors: Po-chun Hsu ; Hung-yi Lee

In this paper, we propose WG-WaveNet, a fast, lightweight, and high-quality waveform generation model. WG-WaveNet is composed of a compact flow-based model and a post-filter. The two components are jointly trained by maximizing the likelihood of the training data and optimizing loss functions on the frequency domains. As we design a flow-based model that is heavily compressed, the proposed model requires much less computational resources compared to other waveform generation models during both training and inference time; even though the model is highly compressed, the post-filter maintains the quality of generated waveform. Our PyTorch implementation can be trained using less than 8 GB GPU memory and generates audio samples at a rate of more than 960 kHz on an NVIDIA 1080Ti GPU. Furthermore, even if synthesizing on a CPU, we show that the proposed method is capable of generating 44.1 kHz speech waveform 1.2 times faster than real-time. Experiments also show that the quality of generated audio is comparable to those of other methods. Audio samples are publicly available online.

#6 What the Future Brings: Investigating the Impact of Lookahead for Incremental Neural TTS [PDF] [Copy] [Kimi¹]

Authors: Brooke Stephenson ; Laurent Besacier ; Laurent Girin ; Thomas Hueber

In incremental text to speech synthesis (iTTS), the synthesizer produces an audio output before it has access to the entire input sentence. In this paper, we study the behavior of a neural sequence-to-sequence TTS system when used in an incremental mode, i.e. when generating speech output for token n, the system has access to n+k tokens from the text sequence. We first analyze the impact of this incremental policy on the evolution of the encoder representations of token n for different values of k (the lookahead parameter). The results show that, on average, tokens travel 88% of the way to their full context representation with a one-word lookahead and 94% after 2 words. We then investigate which text features are the most influential on the evolution towards the final representation using a random forest analysis. The results show that the most salient factors are related to token length. We finally evaluate the effects of lookahead k at the decoder level, using a MUSHRA listening test. This test shows results that contrast with the above high figures: speech synthesis quality obtained with 2 word-lookahead is significantly lower than the one obtained with the full sentence.

#7 Fast and Lightweight On-Device TTS with Tacotron2 and LPCNet [PDF] [Copy] [Kimi¹]

Authors: Vadim Popov ; Stanislav Kamenev ; Mikhail Kudinov ; Sergey Repyevsky ; Tasnima Sadekova ; Vitalii Bushaev ; Vladimir Kryzhanovskiy ; Denis Parkhomenko

We present a fast and lightweight on-device text-to-speech system based on state-of-art methods of feature and speech generation i.e. Tacotron2 and LPCNet. We show that modification of the basic pipeline combined with hardware-specific optimizations and extensive usage of parallelization enables running TTS service even on low-end devices with faster than realtime waveform generation. Moreover, the system preserves high quality of speech without noticeable degradation of Mean Opinion Score compared to the non-optimized baseline. While the system is mostly oriented on low-to-mid range hardware we believe that it can also be used in any CPU-based environment.

#8 Efficient WaveGlow: An Improved WaveGlow Vocoder with Enhanced Speed [PDF] [Copy] [Kimi¹]

Authors: Wei Song ; Guanghui Xu ; Zhengchen Zhang ; Chao Zhang ; Xiaodong He ; Bowen Zhou

Neural vocoder, such as WaveGlow, has become an important component in recent high-quality text-to-speech (TTS) systems. In this paper, we propose Efficient WaveGlow (EWG), a flow-based generative model serving as an efficient neural vocoder. Similar to WaveGlow, EWG has a normalizing flow backbone where each flow step consists of an affine coupling layer and an invertible 1×1 convolution. To reduce the number of model parameters and enhance the speed without sacrificing the quality of the synthesized speech, EWG improves WaveGlow in three aspects. First, the WaveNet-style transform network in WaveGlow is replaced with an FFTNet-style dilated convolution network. Next, to reduce the computation cost, group convolution is applied to both audio and local condition features. At last, the local condition is shared among the transform network layers in each coupling layer. As a result, EWG can reduce the number of floating-point operations (FLOPs) required to generate one-second audio and the number of model parameters both by more than 12 times. Experimental results show that EWG can reduce real-world inference time cost by more than twice, without any obvious reduction in the speech quality.

#9 Can Auditory Nerve Models Tell us What’s Different About WaveNet Vocoded Speech? [PDF] [Copy] [Kimi¹]

Authors: Sébastien Le Maguer ; Naomi Harte

Nowadays, synthetic speech is almost indistinguishable from human speech. The remarkable quality is mainly due to the displacing of signal processing based vocoders in favour of neural vocoders and, in particular, the WaveNet architecture. At the same time, speech synthesis evaluation is still facing difficulties in adjusting to these improvements. These difficulties are even more prevalent in the case of objective evaluation methodologies which do not correlate well with human perception. Yet, an often forgotten use of objective evaluation is to uncover prominent differences between speech signals. Such differences are crucial to decipher the improvement introduced by the use of WaveNet. Therefore, abandoning objective evaluation could be a serious mistake. In this paper, we analyze vocoded synthetic speech re-rendered using WaveNet, comparing it to standard vocoded speech. To do so, we objectively compare spectrograms and neurograms, the latter being the output of AN models. The spectrograms allow us to look at the speech production side, and the neurograms relate to the speech perception path. While we were not yet able to pinpoint how WaveNet and WORLD differ, our results suggest that the Mean-Rate (MR) neurograms in particular warrant further investigation.

#10 Speaker Conditional WaveRNN: Towards Universal Neural Vocoder for Unseen Speaker and Recording Conditions [PDF] [Copy] [Kimi¹]

Authors: Dipjyoti Paul ; Yannis Pantazis ; Yannis Stylianou

Recent advancements in deep learning led to human-level performance in single-speaker speech synthesis. However, there are still limitations in terms of speech quality when generalizing those systems into multiple-speaker models especially for unseen speakers and unseen recording qualities. For instance, conventional neural vocoders are adjusted to the training speaker and have poor generalization capabilities to unseen speakers. In this work, we propose a variant of WaveRNN, referred to as speaker conditional WaveRNN (SC-WaveRNN). We target towards the development of an efficient universal vocoder even for unseen speakers and recording conditions. In contrast to standard WaveRNN, SC-WaveRNN exploits additional information given in the form of speaker embeddings. Using publicly-available data for training, SC-WaveRNN achieves significantly better performance over baseline WaveRNN on both subjective and objective metrics. In MOS, SC-WaveRNN achieves an improvement of about 23% for seen speaker and seen recording condition and up to 95% for unseen speaker and unseen condition. Finally, we extend our work by implementing a multi-speaker text-to-speech (TTS) synthesis similar to zero-shot speaker adaptation. In terms of performance, our system has been preferred over the baseline TTS system by 60% over 15.5% and by 60.9% over 32.6%, for seen and unseen speakers, respectively.

#11 Neural Homomorphic Vocoder [PDF] [Copy] [Kimi¹]

Authors: Zhijun Liu ; Kuan Chen ; Kai Yu

In this paper, we propose the neural homomorphic vocoder (NHV), a source-filter model based neural vocoder framework. NHV synthesizes speech by filtering impulse trains and noise with linear time-varying (LTV) filters. A neural network controls the LTV filters by estimating complex cepstrums of time-varying impulse responses given acoustic features. The proposed framework can be trained with a combination of multi-resolution STFT loss and adversarial loss functions. Due to the use of DSP-based synthesis methods, NHV is highly efficient, fully controllable and interpretable. A vocoder was built under the framework to synthesize speech given log-Mel spectrograms and fundamental frequencies. While the model cost only 15 kFLOPs per sample, the synthesis quality remained comparable to baseline neural vocoders in both copy-synthesis and text-to-speech.

#12 g2pM: A Neural Grapheme-to-Phoneme Conversion Package for Mandarin Chinese Based on a New Open Benchmark Dataset [PDF] [Copy] [Kimi¹]

Authors: Kyubyong Park ; Seanie Lee

Conversion of Chinese graphemes to phonemes (G2P) is an essential component in Mandarin Chinese Text-To-Speech (TTS) systems. One of the biggest challenges in Chinese G2P conversion is how to disambiguate the pronunciation of polyphones — characters having multiple pronunciations. Although many academic efforts have been made to address it, there has been no open dataset that can serve as a standard benchmark for a fair comparison to date. In addition, most of the reported systems are hard to employ for researchers or practitioners who want to convert Chinese text into pinyin at their convenience. Motivated by these, in this work, we introduce a new benchmark dataset that consists of 99,000+ sentences for Chinese polyphone disambiguation. We train a simple Bi-LSTM model on it and find that it outperforms other pre-existing G2P systems and slightly underperforms pre-trained Chinese BERT. Finally, we package our project and share it on PyPi.

#13 A Mask-Based Model for Mandarin Chinese Polyphone Disambiguation [PDF] [Copy] [Kimi²]

Authors: Haiteng Zhang ; Huashan Pan ; Xiulin Li

Polyphone disambiguation serves as an essential part of Mandarin text-to-speech (TTS) system. However, conventional system modelling the entire Pinyin set causes the case that prediction belongs to the unrelated polyphonic character instead of the current input one, which has negative impacts on TTS performance. To address this issue, we introduce a mask-based model for polyphone disambiguation. The model takes a mask vector extracted from the context as an extra input. In our model, the mask vector not only acts as a weighting factor in Weighted-softmax to prevent the case of mis-prediction but also eliminates the contribution of non-candidate set to the overall loss. Moreover, to mitigate the uneven distribution of pronunciation, we introduce a new loss called Modified Focal Loss. The experimental result shows the effectiveness of the proposed mask-based model. We also empirically studied the impact of Weighted-softmax and Modified Focal Loss. It was found that Weighted-softmax can effectively prevent the model from predicting outside the candidate set. Besides, Modified Focal Loss can reduce the adverse impacts of the uneven distribution of pronunciation.

#14 Perception of Concatenative vs. Neural Text-To-Speech (TTS): Differences in Intelligibility in Noise and Language Attitudes [PDF] [Copy] [Kimi¹]

Authors: Michelle Cohn ; Georgia Zellou

This study tests speech-in-noise perception and social ratings of speech produced by different text-to-speech (TTS) synthesis methods. We used identical speaker training datasets for a set of 4 voices (using AWS Polly TTS), generated using neural and concatenative TTS. In Experiment 1, listeners identified target words in semantically predictable and unpredictable sentences in concatenative and neural TTS at two noise levels (-3 dB, -6 dB SNR). Correct word identification was lower for neural TTS than for concatenative TTS, in the lower SNR, and for semantically unpredictable sentences. In Experiment 2, listeners rated the voices on 4 social attributes. Neural TTS was rated as more human-like, natural, likeable, and familiar than concatenative TTS. Furthermore, how natural listeners rated the neural TTS voice was positively related to their speech-in-noise accuracy. Together, these findings show that the TTS method influences both intelligibility and social judgments of speech — and that these patterns are linked. Overall, this work contributes to our understanding of the nexus of speech technology and human speech perception.

#15 Enhancing Sequence-to-Sequence Text-to-Speech with Morphology [PDF] [Copy] [Kimi¹]

Authors: Jason Taylor ; Korin Richmond

Neural sequence-to-sequence (S2S) modelling encodes a single, unified representation for each input sequence. When used for text-to-speech synthesis (TTS), such representations must embed ambiguities between English spelling and pronunciation. For example, in pothole and there the character sequence th sounds different. This can be problematic when predicting pronunciation directly from letters. We posit pronunciation becomes easier to predict when letters are grouped into sub-word units like morphemes (e.g. a boundary lies between t and h in pothole but not there). Moreover, morphological boundaries can reduce the total number of, and increase the counts of, seen unit subsequences. Accordingly, we test here the effect of augmenting input sequences of letters with morphological boundaries. We find morphological boundaries substantially lower the Word and Phone Error Rates (WER and PER) for a Bi-LSTM performing G2P on one hand, and also increase the naturalness scores of Tacotrons performing TTS in a MUSHRA listening test on the other. The improvements to TTS quality are such that grapheme input augmented with morphological boundaries outperforms phone input without boundaries. Since morphological segmentation may be predicted with high accuracy, we highlight this simple pre-processing step has important potential for S2S modelling in TTS.

#16 Deep MOS Predictor for Synthetic Speech Using Cluster-Based Modeling [PDF] [Copy] [Kimi¹]

Authors: Yeunju Choi ; Youngmoon Jung ; Hoirin Kim

While deep learning has made impressive progress in speech synthesis and voice conversion, the assessment of the synthesized speech is still carried out by human participants. Several recent papers have proposed deep-learning-based assessment models and shown the potential to automate the speech quality assessment. To improve the previously proposed assessment model, MOSNet, we propose three models using cluster-based modeling methods: using a global quality token (GQT) layer, using an Encoding Layer, and using both of them. We perform experiments using the evaluation results of the Voice Conversion Challenge 2018 to predict the mean opinion score of synthesized speech and similarity score between synthesized speech and reference speech. The results show that the GQT layer helps to predict human assessment better by automatically learning the useful quality tokens for the task and that the Encoding Layer helps to utilize frame-level scores more precisely.

#17 Deep Learning Based Assessment of Synthetic Speech Naturalness [PDF] [Copy] [Kimi]

Authors: Gabriel Mittag ; Sebastian Möller

In this paper, we present a new objective prediction model for synthetic speech naturalness. It can be used to evaluate Text-To-Speech or Voice Conversion systems and works language independently. The model is trained end-to-end and based on a CNN-LSTM network that previously showed to give good results for speech quality estimation. We trained and tested the model on 16 different datasets, such as from the Blizzard Challenge and the Voice Conversion Challenge. Further, we show that the reliability of deep learning-based naturalness prediction can be improved by transfer learning from speech quality prediction models that are trained on objective POLQA scores. The proposed model is made publicly available and can, for example, be used to evaluate different TTS system configurations.

#18 Distant Supervision for Polyphone Disambiguation in Mandarin Chinese [PDF] [Copy] [Kimi]

Authors: Jiawen Zhang ; Yuanyuan Zhao ; Jiaqi Zhu ; Jinba Xiao

Grapheme-to-phoneme (G2P) conversion plays an important role in building a Mandarin Chinese text-to-speech (TTS) system, where the polyphone disambiguation is an indispensable task. However, most of the previous polyphone disambiguation models are trained on manually annotated datasets, which are suffering from data scarcity, narrow coverage, and unbalanced data distribution. In this paper, we propose a framework that can predict the pronunciations of Chinese characters, and the core model is trained in a distantly supervised way. Specifically, we utilize the alignment procedure used for acoustic models to produce abundant character-phoneme sequence pairs, which are employed to train a Seq2Seq model with attention mechanism. We also make use of a language model that is trained on phoneme sequences to alleviate the impact of noises in the auto-generated dataset. Experimental results demonstrate that even without additional syntactic features and pre-trained embeddings, our approach achieves competitive prediction results, and especially improves the predictive accuracy for unbalanced polyphonic characters. In addition, compared with the manually annotated training datasets, the auto-generated one is more diversified and makes the results more consistent with the pronunciation habits of most people.

#19 An Unsupervised Method to Select a Speaker Subset from Large Multi-Speaker Speech Synthesis Datasets [PDF] [Copy] [Kimi¹]

Authors: Pilar Oplustil Gallegos ; Jennifer Williams ; Joanna Rownicka ; Simon King

Large multi-speaker datasets for TTS typically contain diverse speakers, recording conditions, styles and quality of data. Although one might generally presume that more data is better, in this paper we show that a model trained on a carefully-chosen subset of speakers from LibriTTS provides significantly better quality synthetic speech than a model trained on a larger set. We propose an unsupervised methodology to find this subset by clustering per-speaker acoustic representations.

#20 Understanding the Effect of Voice Quality and Accent on Talker Similarity [PDF] [Copy] [Kimi¹]

Authors: Anurag Das ; Guanlong Zhao ; John Levis ; Evgeny Chukharev-Hudilainen ; Ricardo Gutierrez-Osuna

This paper presents a methodology to study the role of non-native accents on talker recognition by humans. The methodology combines a state-of-the-art accent-conversion system to resynthesize the voice of a speaker with a different accent of her/his own, and a protocol for perceptual listening tests to measure the relative contribution of accent and voice quality on speaker similarity. Using a corpus of non-native and native speakers, we generated accent conversions in two different directions: non-native speakers with native accents, and native speakers with non-native accents. Then, we asked listeners to rate the similarity between 50 pairs of real or synthesized speakers. Using a linear mixed effects model, we find that (for our corpus) the effect of voice quality is five times as large as that of non-native accent, and that the effect goes away when speakers share the same (native) accent. We discuss the potential significance of this work in earwitness identification and sociophonetics.

#21 Using Cyclic Noise as the Source Signal for Neural Source-Filter-Based Speech Waveform Model [PDF] [Copy] [Kimi]

Authors: Xin Wang ; Junichi Yamagishi

Neural source-filter (NSF) waveform models generate speech waveforms by morphing sine-based source signals through dilated convolution in the time domain. Although the sine-based source signals help the NSF models to produce voiced sounds with specified pitch, the sine shape may constrain the generated waveform when the target voiced sounds are less periodic. In this paper, we propose a more flexible source signal called cyclic noise, a quasi-periodic noise sequence given by the convolution of a pulse train and a static random noise with a trainable decaying rate that controls the signal shape. We further propose a masked spectral loss to guide the NSF models to produce periodic voiced sounds from the cyclic noise-based source signal. Results from a large-scale listening test demonstrated the effectiveness of the cyclic noise and the masked spectral loss on speaker-independent NSF models in copy-synthesis experiments on the CMU ARCTIC database.

#22 Unconditional Audio Generation with Generative Adversarial Networks and Cycle Regularization [PDF] [Copy] [Kimi¹]

Authors: Jen-Yu Liu ; Yu-Hua Chen ; Yin-Cheng Yeh ; Yi-Hsuan Yang

In a recent paper, we have presented a generative adversarial network (GAN)-based model for unconditional generation of the mel-spectrograms of singing voices. As the generator of the model is designed to take a variable-length sequence of noise vectors as input, it can generate mel-spectrograms of variable length. However, our previous listening test shows that the quality of the generated audio leaves room for improvement. The present paper extends and expands that previous work in the following aspects. First, we employ a hierarchical architecture in the generator to induce some structure in the temporal dimension. Second, we introduce a cycle regularization mechanism to the generator to avoid mode collapse. Third, we evaluate the performance of the new model not only for generating singing voices, but also for generating speech voices. Evaluation result shows that new model outperforms the prior one both objectively and subjectively. We also employ the model to unconditionally generate sequences of piano and violin music and find the result promising. Audio examples, as well as the code for implementing our model, will be publicly available online upon paper publication.

#23 Complex-Valued Variational Autoencoder: A Novel Deep Generative Model for Direct Representation of Complex Spectra [PDF] [Copy] [Kimi¹]

Author: Toru Nakashika

In recent years, variational autoencoders (VAEs) have been attracting interest for many applications and generative tasks. Although the VAE is one of the most powerful deep generative models, it still has difficulty representing complex-valued data such as the complex spectra of speech. In speech synthesis, we usually use the VAE to encode Mel-cepstra, or raw amplitude spectra, from a speech signal into normally distributed latent features and then synthesize the speech from the reconstruction by using the Griffin-Lim algorithm or other vocoders. Such inputs are originally calculated from complex spectra but lack the phase information, which leads to degradation when recovering speech. In this work, we propose a novel generative model to directly encode the complex spectra by extending the conventional VAE. The proposed model, which we call the complex-valued VAE (CVAE), consists of two complex-valued neural networks (CVNNs) of an encoder and a decoder. In the CVAE, not only the inputs and the parameters of the encoder and decoder but also the latent features are defined as complex-valued to preserve the phase information throughout the network. The results of our speech encoding experiments demonstrated the effectiveness of the CVAE compared to the conventional VAE in both objective and subjective criteria.

#24 Attentron: Few-Shot Text-to-Speech Utilizing Attention-Based Variable-Length Embedding [PDF] [Copy] [Kimi¹]

Authors: Seungwoo Choi ; Seungju Han ; Dongyoung Kim ; Sungjoo Ha

On account of growing demands for personalization, the need for a so-called few-shot TTS system that clones speakers with only a few data is emerging. To address this issue, we propose Attentron, a few-shot TTS model that clones voices of speakers unseen during training. It introduces two special encoders, each serving different purposes. A fine-grained encoder extracts variable-length style information via an attention mechanism, and a coarse-grained encoder greatly stabilizes the speech synthesis, circumventing unintelligible gibberish even for synthesizing speech of unseen speakers. In addition, the model can scale out to an arbitrary number of reference audios to improve the quality of the synthesized speech. According to our experiments, including a human evaluation, the proposed model significantly outperforms state-of-the-art models when generating speech for unseen speakers in terms of speaker similarity and quality.

#25 Reformer-TTS: Neural Speech Synthesis with Reformer Network [PDF] [Copy] [Kimi¹]

Authors: Hyeong Rae Ihm ; Joun Yeop Lee ; Byoung Jin Choi ; Sung Jun Cheon ; Nam Soo Kim

Recent End-to-end text-to-speech (TTS) systems based on the deep neural network (DNN) have shown the state-of-the-art performance on the speech synthesis field. Especially, the attention-based sequence-to-sequence models have improved the quality of the alignment between the text and spectrogram successfully. Leveraging such improvement, speech synthesis using a Transformer network was reported to generate humanlike speech audio. However, such sequence-to-sequence models require intensive computing power and memory during training. The attention scores are calculated over the entire key at every query sequence, which increases memory usage. To mitigate this issue, we propose Reformer-TTS, the model using a Reformer network which utilizes the locality-sensitive hashing attention and the reversible residual network. As a result, we show that the Reformer network consumes almost twice smaller memory margin as the Transformer, which leads to the fast convergence of training end-to-end TTS system. We demonstrate such advantages with memory usage, objective, and subjective performance evaluation.